Goto

Collaborating Authors

 Scientific Discovery


Roadkill is a surprising and untapped source for scientists

Popular Science

Millions of animals unfortunately die on roads each year, but the casualties hold important data. Breakthroughs, discoveries, and DIY tips sent six days a week. As much as people try to avoid it (and not contribute to it), the untimely animal deaths are an unfortunate, inevitable byproduct of a society reliant on cars. In Brazil alone, it's estimated that anywhere between two and eight million birds and mammals are killed on roadways every year. In Europe, the potential tally may climb as high as 194 million .


A unified framework for bandit multiple testing

Neural Information Processing Systems

In bandit multiple hypothesis testing, each arm corresponds to a different null hypothesis that we wish to test, and the goal is to design adaptive algorithms that correctly identify large set of interesting arms (true discoveries), while only mistakenly identifying a few uninteresting ones (false discoveries). One common metric in non-bandit multiple testing is the false discovery rate (FDR). We propose a unified, modular framework for bandit FDR control that emphasizes the decoupling of exploration and summarization of evidence. We utilize the powerful martingale-based concept of e-processes to ensure FDR control for arbitrary composite nulls, exploration rules and stopping times in generic problem settings. In particular, valid FDR control holds even if the reward distributions of the arms could be dependent, multiple arms may be queried simultaneously, and multiple (cooperating or competing) agents may be querying arms, covering combinatorial semi-bandit type settings as well. Prior work has considered in great detail the setting where each arm's reward distribution is independent and sub-Gaussian, and a single arm is queried at each step. Our framework recovers matching sample complexity guarantees in this special case, and performs comparably or better in practice. For other settings, sample complexities will depend on the finer details of the problem (composite nulls being tested, exploration algorithm, data dependence structure, stopping rule) and we do not explore these; our contribution is to show that the FDR guarantee is clean and entirely agnostic to these details.


DiscoveryWorld: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents

Neural Information Processing Systems

Automated scientific discovery promises to accelerate progress across scientific domains, but evaluating an agent's capacity for end-to-end scientific reasoning is challenging as running real-world experiments is often prohibitively expensive or infeasible. In this work we introduce DiscoveryWorld, a virtual environment that enables benchmarking an agent's ability to perform complete cycles of novel scientific discovery in an inexpensive, simulated, multi-modal, long-horizon, and fictional setting.DiscoveryWorld consists of 24 scientific tasks across three levels of difficulty, each with parametric variations that provide new discoveries for agents to make across runs. Tasks require an agent to form hypotheses, design and run experiments, analyze results, and act on conclusions. Task difficulties are normed to range from straightforward to challenging for human scientists with advanced degrees. DiscoveryWorld further provides three automatic metrics for evaluating performance, including: (1) binary task completion, (2) fine-grained report cards detailing procedural scoring of task-relevant actions, and (3) the accuracy of discovered explanatory knowledge.While simulated environments such as DiscoveryWorld are low-fidelity compared to the real world, we find that strong baseline agents struggle on most DiscoveryWorld tasks, highlighting the utility of using simulated environments as proxy tasks for near-term development of scientific discovery competency in agents.


'Living rocks' suck up a lot of carbon

Popular Science

Super tough microbialites are some of the oldest evidence of life on Earth. Breakthroughs, discoveries, and DIY tips sent every weekday. Among the tricky carnivorous plants, great white shark-killing orca whales, and other remarkable flora and fauna that call South Africa home is a remarkable group of "living rocks." Called microbialites, these communities are similar to coral reefs and are built up by microbes. These tiny living organisms absorb and release dissolved minerals into more solid rock-like forms.


AI-Newton: A Concept-Driven Physical Law Discovery System without Prior Physical Knowledge

Fang, You-Le, Jian, Dong-Shan, Li, Xiang, Ma, Yan-Qing

arXiv.org Artificial Intelligence

Advances in artificial intelligence (AI) have made AI-driven scientific discovery a highly promising new paradigm [1]. Although AI has achieved remarkable results in tackling domain-specific challenges [2, 3], the ultimate aspiration from a paradigm-shifting perspective still lies in developing reliable AI systems capable of autonomous scientific discovery directly from a large collection of raw data without supervision [4, 5]. Current approaches to automated physics discovery focus on individual experiments, employing either neural network (NN)-based methods [6-25] or symbolic techniques [26-33]. By analyzing data from a single experiment, these methods can construct a specific model capable of predicting future data from the same experiment; if sufficiently simple, such a model may even be expressed in symbolic form [34-36]. Although these methods represent a crucial and successful stage towards automated scientific discovery, they have not yet reached a discovery capacity comparable to that of human physicists.


PRiSM: An Agentic Multimodal Benchmark for Scientific Reasoning via Python-Grounded Evaluation

Imani, Shima, Moon, Seungwhan, Ahmadyan, Adel, Zhang, Lu, Ahmed, Kirmani, Damavandi, Babak

arXiv.org Artificial Intelligence

Evaluating vision-language models (VLMs) in scientific domains like mathematics and physics poses unique challenges that go far beyond predicting final answers. These domains demand conceptual understanding, symbolic reasoning, and adherence to formal laws, requirements that most existing benchmarks fail to address. In particular, current datasets tend to be static, lacking intermediate reasoning steps, robustness to variations, or mechanisms for verifying scientific correctness. To address these limitations, we introduce PRiSM, a synthetic, fully dynamic, and multimodal benchmark for evaluating scientific reasoning via grounded Python code. PRiSM includes over 24,750 university-level physics and math problems, and it leverages our scalable agent-based pipeline, PrismAgent, to generate well-structured problem instances. Each problem contains dynamic textual and visual input, a generated figure, alongside rich structured outputs: executable Python code for ground truth generation and verification, and detailed step-by-step reasoning. The dynamic nature and Python-powered automated ground truth generation of our benchmark allow for fine-grained experimental auditing of multimodal VLMs, revealing failure modes, uncertainty behaviors, and limitations in scientific reasoning. To this end, we propose five targeted evaluation tasks covering generalization, symbolic program synthesis, perturbation robustness, reasoning correction, and ambiguity resolution. Through comprehensive evaluation of existing VLMs, we highlight their limitations and showcase how PRiSM enables deeper insights into their scientific reasoning capabilities.


From Task Executors to Research Partners: Evaluating AI Co-Pilots Through Workflow Integration in Biomedical Research

Weidener, Lukas, Brkić, Marko, Bacci, Chiara, Jovanović, Mihailo, Ulgac, Emre, Dobrin, Alex, Weniger, Johannes, Vlas, Martin, Singh, Ritvik, Meduri, Aakaash

arXiv.org Artificial Intelligence

Artificial intelligence systems are increasingly deployed in biomedical research. However, current evaluation frameworks may inadequately assess their effectiveness as research collaborators. This rapid review examines benchmarking practices for AI systems in preclinical biomedical research. Three major databases and two preprint servers were searched from January 1, 2018 to October 31, 2025, identifying 14 benchmarks that assess AI capabilities in literature understanding, experimental design, and hypothesis generation. The results revealed that all current benchmarks assess isolated component capabilities, including data analysis quality, hypothesis validity, and experimental protocol design. However, authentic research collaboration requires integrated workflows spanning multiple sessions, with contextual memory, adaptive dialogue, and constraint propagation. This gap implies that systems excelling on component benchmarks may fail as practical research co-pilots. A process-oriented evaluation framework is proposed that addresses four critical dimensions absent from current benchmarks: dialogue quality, workflow orchestration, session continuity, and researcher experience. These dimensions are essential for evaluating AI systems as research co-pilots rather than as isolated task executors.


E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing

Sadhuka, Shuvom, Prinster, Drew, Fannjiang, Clara, Scalia, Gabriele, Regev, Aviv, Wang, Hanchen

arXiv.org Machine Learning

Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the success of their trajectories, researchers have developed verifiers, such as LLM judges and process-reward models, to score the quality of each action in an agent's trajectory. Although these heuristic scores can be informative, there are no guarantees of correctness when used to decide whether an agent will yield a successful output. Here, we introduce e-valuator, a method to convert any black-box verifier score into a decision rule with provable control of false alarm rates. We frame the problem of distinguishing successful trajectories (that is, a sequence of actions that will lead to a correct response to the user's prompt) and unsuccessful trajectories as a sequential hypothesis testing problem. E-valuator builds on tools from e-processes to develop a sequential hypothesis test that remains statistically valid at every step of an agent's trajectory, enabling online monitoring of agents over arbitrarily long sequences of actions. Empirically, we demonstrate that e-valuator provides greater statistical power and better false alarm rate control than other strategies across six datasets and three agents. We additionally show that e-valuator can be used for to quickly terminate problematic trajectories and save tokens. Together, e-valuator provides a lightweight, model-agnostic framework that converts verifier heuristics into decisions rules with statistical guarantees, enabling the deployment of more reliable agentic systems.


Hypothesis Testing for Generalized Thurstone Models

Makur, Anuran, Singh, Japneet

arXiv.org Machine Learning

In this work, we develop a hypothesis testing framework to determine whether pairwise comparison data is generated by an underlying \emph{generalized Thurstone model} $\mathcal{T}_F$ for a given choice function $F$. While prior work has predominantly focused on parameter estimation and uncertainty quantification for such models, we address the fundamental problem of minimax hypothesis testing for $\mathcal{T}_F$ models. We formulate this testing problem by introducing a notion of separation distance between general pairwise comparison models and the class of $\mathcal{T}_F$ models. We then derive upper and lower bounds on the critical threshold for testing that depend on the topology of the observation graph. For the special case of complete observation graphs, this threshold scales as $Θ((nk)^{-1/2})$, where $n$ is the number of agents and $k$ is the number of comparisons per pair. Furthermore, we propose a hypothesis test based on our separation distance, construct confidence intervals, establish time-uniform bounds on the probabilities of type I and II errors using reverse martingale techniques, and derive minimax lower bounds using information-theoretic methods. Finally, we validate our results through experiments on synthetic and real-world datasets.


Hierarchical AI-Meteorologist: LLM-Agent System for Multi-Scale and Explainable Weather Forecast Reporting

Sukhorukov, Daniil, Zakharov, Andrei, Glazkov, Nikita, Yanchanka, Katsiaryna, Kirilin, Vladimir, Dubovitsky, Maxim, Sultimov, Roman, Maksimov, Yuri, Makarov, Ilya

arXiv.org Artificial Intelligence

We present the Hierarchical AI-Meteorologist, an LLM-agent system that generates explainable weather reports using a hierarchical forecast reasoning and weather keyword generation. Unlike standard approaches that treat forecasts as flat time series, our framework performs multi-scale reasoning across hourly, 6-hour, and daily aggregations to capture both short-term dynamics and long-term trends. Its core reasoning agent converts structured meteorological inputs into coherent narratives while simultaneously extracting a few keywords effectively summarizing the dominant meteorological events. These keywords serve as semantic anchors for validating consistency, temporal coherence and factual alignment of the generated reports. Using OpenWeather and Meteostat data, we demonstrate that hierarchical context and keyword-based validation substantially improve interpretability and robustness of LLM-generated weather narratives, offering a reproducible framework for semantic evaluation of automated meteorological reporting and advancing agent-based scientific reasoning.